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Abstract 



Q ■ Through a series of exact mappings we reinterpret the Bernoulh model of sequence aUgnment in 

^ . terms of the discrete-time totahy asymmetric exclusion process with backward sequential update 

j^ ^ and step function initial condition. Using earlier results from the Bethe ansatz we obtain analyti- 

-)— > . 

"^ ■ cally the exact distribution of the length of the longest common subsequence of two sequences of 

-)— > 

rH . finite lengths X,Y. Asymptotic analysis adapted from random matrix theory allows us to derive 

1-^ ! the thermodynamic limit directly from the finite-size result. 
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INTRODUCTION 

Sequence alignment deals with the problem of identifying similarities between two differ- 
ent sequences of objects, represented by "letters" from some "alphabet". This problem has a 
long history in combinatorics and in probability theory where one wishes to find the longest 



fl. 



More 



common subsequence (LCS for short) between two random sequences of letters 
recently, sequence alignment has become a central notion in evolutionary biology where it 
is used to probe functional, structural or evolutionary relationships between DNA or RNA 
strands or proteins _^. In this setting one wishes to quantify how "close" two sequences of 
genetic information are by identifying the LCS of the same gene in different species. 

Given a pair of fixed sequences of c letters of lengths X and Y, the length of their LCS 



is defined by the recursion 



ik 



Lx,Y = max[Lx-i,y, Lx,y-i, Lx-i,y-i + r]x,Y] (1) 

with the boundary conditions Lj^q = Lqj = Lo,o = for all i,j > 0. The variable rix,Y is 
1 if the letters at the positions X and Y match each other, and if they do not. If one 
ignores the correlations between different r]x,Y, and takes them from the bimodal distribu- 
tion F{r]) = p(5^ 1 + (1 — p)(5^,0) one gets the Bernoulli matching (BM) model of sequence 



a 



alignment [j]. To get a model closest to the original LCS problem, one has to put p = 1/c. 
In the thermodynamic limit of infinitely long sequences this problem has been studied in 
some detail. With X = xN, Y = yN, Seppalainen derived rigorously the law of large num- 
bers limit. Asymptotically the quantity Lxy /N is a random variable converging a.s. to a 
function oi p,x,y which he computed explicitly |5|. Using an exact mapping to a directed 
polymer problem, complemented with scaling arguments, it was shown more recently 6| 
that asymptotically the quantity Lx,y is a random variable of the form 

Lx,Y ""-^ 7p{x, y)N + <5p(x, y)N^'\ (2) 

where 'jp{x,y),6p{x,y) are known scale factors and x is a random variable drawn from 
the Tracy- Widom distribution of the largest eigenvalue of CUE random matrices [13]. In 
subsequent work [7] some related quantities were obtained for the thermodynamic limit using 
a mapping to a 5-vertex model and applying the Bethe ansatz. 

In this paper we compute analytically the exact distribution of Lx,y for finite sequences 
by a mapping of the BM problem to a stochastic exclusion process. The mapping of the 



sequence alignment problem onto the asymmetric exclusion process has been proposed in 
The hopping dynamic considered in 8j is the asymmetric exclusion process with sublattice- 
parallel update which admits a transfer-matrix formulation and diagonalization of the matrix 
for finite X and Y . Since our interest lies in an analytical solution for arbitrary X and Y , we 
choose another mapping onto a discrete-time fragmentation process which is equivalent to a 
totally asymmetric simple exclusion process with backward sequential update [lO|, ll6|. This 
allows us to use earlier results obtained directly from Bethe ansatz [13] for this stochastic 
lattice gas model. Specifically, we will express the probability that the length of the LCS is 
at most Q by the probability that the number of jumps of a selected particle in the exclusion 
process up to time Y is at least X — Q. We also outline how ([2]) arises in the thermodynamic 
limit from the result for finite sequences. 



MAPPING OF THE BM MODEL TO AN EXCLUSION PROCESS 

In Fig. [T](a) we illustrate the LCS problem in matrix form for two sequences of lengths 
X = 7 and F = 10 from an alphabet of the four letters A, C, G, T used in DNA sequencing. 
The vertical sequence is read from bottom to top, the horizontal sequence is read from left 
to right. Whenever two letters match there is a bold face 1 in the matrix. The recursion ([1]) 
(its solution is shown in small blue numbers in the matrix) generates a terrace-like structure 
(red lines) where the number of terraces is Lx,y- The boundary condition of the recursion 
amounts to assigning value to the boxes containing the letters of the two sequences and 
also to the empty lower left starting corner. Solving the recursion ([1]), one can note that 
blue numbers in Fig. [T](a) appear with different statistical weights. Indeed, numbers at left 
corners of terraces appear when 'r]x,Y = 1 and have weight p. All the rest of numbers at 
edges of terraces do not depend on f]x,Y and have therefore weight 1. All remaining numbers 
appear when rjxx = having weight (1 — p). It is useful to view the grid that defines the 
matching matrix as a square lattice with X x Y bulk sites, embedded in the rectangle of 
size (X + 1) X (y + 1). Each square (defining the dual square lattice) is labelled {i,j) with 
< i < X and < j < Y. Due to the terraces, each site can take one of five different 
states. It may be (i) traversed horizontally or vertically by a (red) terrace line (ii) represent 
a left or right corner of a terrace, or (iii) be empty. 

By construction, in the BM model each empty site has weight q = 1 —p, each left corner 
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FIG. 1: (a) Matrix representation of the LCS problem in sequence alignment for two sequences 
of lengths y = 10 (vertical, bottom to top) and X = 7 (horizontal, left to right). Matches are 
denoted by a bold face 1 in the matrix. The small blue integers are the solution of the recursion 
([1]) with boundary conditions Lqj = -Lt,o = (not shown in (a), but in (b)). The red lines are 
the level lines that separate terraces of different height. The dashed line follows the LCS CCATG 
of length 5. (b) Mapping to the five vertex model obtained by interchanging the colours of the 
vertical lines and identifying lines with arrows as shown in Fig. [2I|a). Vertex weights p are marked 
by an x, weights 1 — p are marked by a bullet. 



of each line has weight p, and all remaining sites have weight 1. This property allows for a 
mapping to a five- vertex model, see Fig. [T](b). For reasons that become apparent below we 
define this mapping slightly differently from [7] by interchanging the colour of all vertical 
lines. The resulting pattern of intersecting black and red lines then becomes isomorphic to 
the pattern of in- and outging arrows in the five-vertex model with vertex weights given 
by the weights of the BM model. One simply identifies black (red) horizontal lines with 
right-pointing (left-pointing) arrows and black (red) vertical lines with up-pointing (down- 
pointing) arrows as shown in Fig. 2(a). 

A similar terrace-like structure appears in the anisotropic 3D directed percolation model 
(ADP) solved by Rajesh and Dhar 12]. A difference is that levels lines in the ADP can 
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FIG. 2: (a) Mapping of line intersection to vertices of the six-vertex model which is effectively a 
five-vertex model since one of the vertex weights is zero, (b) A way to avoid line intersections in 
the five vertex model: if a horizontal line has a left adjacent vertical line below and a right vertical 
line above, it is replaced by the diagonal shortcut. 

overlap. The overlapping lines can be separated by successive shifts of terraces, then one gets 
the five-vertex model with the same vertex weights as described above. However, the shifts 
destroy the domain-wall boundary conditions which are essential for thw exaxt solution of 
the BM model. Then, the analogy between the ADP and the BM models can be used only 
in the thermodynamic limit as it was demonstrated in 6J. 

It is useful to consider a further mapping of this 5-vertex model onto a discrete-time 
stochastic process, considering the vertical direction as time and horizontal one as discrete 
space. To this end, we first turn the red vertex lines (arrows pointing left or down) into non- 
intersecting particle world lines by replacing a right-left turn with a diagonal "shortcut" , as 
shown in Fig. Mjo). After a space reflection i ^ i' = 1 — i this yields a non-intersecting line 
ensemble as shown in Fig. 3. 

A final mapping is aimed to obtain the line ensemble of particle world lines of the discrete- 
time totally asymmetric exclusion process (TASEP) with the backward sequential update, 
introduced in 10|] and solved in 16|. To this end, we consider each trajectory in Fig. [3] and 
replace each move upward by a diagonal move right and each diagonal move left by a move 
upward (Fig. Hj). In a more formal way, we consider a new square lattice {i' + j + 1/2, j) 
and draw new trajectories using the correspondence {i' + 1/2, j) -^ {i' + j + 1/2, j), see 
Fig. 4. The sites of the new lattice are denoted by coordinates {k, t) numbered by integers 
k = i' + j and t = j. By construction the red lines move upward or diagonally and define the 
world lines of exclusion particles which jump only to the right. The vertex weights assign 
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FIG. 3: Mapping of the five-vertex model to non-intersecting worldlines after space reflection 
i — > 1 — z. The green worldline is the first line (seen from the right) which does not reach the top 
of the grid. 

the appropriate probability to each path ensemble. 

The backward sequential dynamics encoded in the vertex weights may be described as 
follows. In each time particle position are updated sequentially from right to left, starting 
from the rightmost particle. Each step of a particle by one lattice unit in positive direction 
has probability I — p, provided the neighbouring target site is empty. If the target site is 
occupied, the jump attempt is rejected with probability 1. No backward moves are allowed, 
making the exclusion process totally asjTiimetric. The horizontal boundary condition of 
the original sequence matching problem maps into an initial condition where at time t = 
particles occupy consecutive dual lattice points —X + 1 < A; < 0. Since the motion of a 
particle is not influenced by any particles to its left, we may extend the lattice to minus 
infinity. The vertical boundary condition is equivalent to extending the lattice to plus 
infinity, such that at time t = all sites k > are vacant. Thus one has a TASEP on an 
initially half-filled infinite lattice with step initial state. However, only the first X particles 
contribute to the statistical properties of the BM model. 
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FIG. 4: Worldlines of the TASEP. The space-time grid of the exclusion process is the square lattice 
{j! + J + 1/2, j) with new coordinates (A;,t) numbered by integers k = i' + j and t = j. The 
green worldline is the first line (seen from the right) which does not reach the target position 
{k, t) = {Y -X + 1,Y) = (4, 10) that yields Lx,y ior X = 7,Y = 10. 

In the exclusion picture the terrace height has a simple probabilistic interpretation. It 
counts the number of world lines that intersects with a diagonal in the square lattice starting 
from the point {k, t) = {—x, 0) (the left dotted line in Fig. HI Hence, at any given time step, 
the terrace increases at each site from right to left by one unit, unless a world line has been 
crossed when going from right to left. Therefore the number n of trajectories ending at time 
t = Y and the length Lxy of the LCS of the BM model on the rectangle (X + 1) x (F + 1) 
are related by Lx,y = X — n. 

PROBABILITY DISTRIBUTION OF THE LCS FOR FINITE SEQUENCES 



Our aim is the evaluation of the probability distribution A^y = Prob[Lx,y = Q] of the 
Bernoulli model. Having the TASEP interpretation of the original model, we need to evaluate 
an appropriate sum over end points of trajectories of particles. To do this, we select the first 
trajectory (counted from the right) which does not end at time Y in the target range of the 
dual lattice given by the top row (i, Y with I < i < X (the green line in Fig. 4). An important 
observation is that the sum of weights of all trajectories ending at times Ti, T2, . . .T^ < Y 



(all lines to the left of the green line) is 1 for the conservation of probabilities in the TASEP. 
Then, the distribution A^y is the sum over the probabilities of all trajectories with end points 
right of the green line and over end points of the green line itself. Hence of all X particles 
only the rightmost n + 1 = X — Q + l particles are relevant for the computation of A^y . The 
initial positions of these particles are ki = —X + Q,k2 = ki + Ijk^ = k2 + 1, ■ ■ ■ , ^n+i = 0. 
Following the relation between terrace height Q and particle trajectories as discussed 
above, we may consider the final positions Xi,X2,--- ,Xn+i at the moment of time Y. By 
the construction, we have Y — X + l<X2<X3<---< Xn+i < Y and Xi < Y — X. We 
first consider Q = X. In this case no particle has reached site Y — X + 1. In particular, 
this implies that the first particle (initially at site 0) has not reached site Y — X + 1. The 
complementary probability for this event is the probability P{Y — X + 1,1, Y) that the first 
particle has jumped at least Y — X + 1 times up to time Y. Hence 

Ajy = l-P(F-X + l,l,y). (3) 

Now consider Q < X. Then A^y is the joint probability that after Y time steps all 
rightmost X — Q particles (located initially on {—X + Q + 1, . . . ,0)) have reached sites 
> Y—X+1 and the next particle (located initially on —X+Q) has not reached site Y—X+1. 
This is equivalent to the joint probability that the particle originally at —X + Q + 1 has 
jumped at least Y ~ Q times and the particle originally at —X + Q has jumped not more 
than Y — Q times. By construction of the process this joint probability may be expressed as 
the statistical weight of all paths where the particle initially at —X + Q + 1 jumps at least 
Y — Q times minus the statistical weight of all paths where the particle initially at —X + Q 
jumps at least Y — Q + 1 times. 

We have come to a known problem of the TASEP statistics [ll|, [l3| . Consider an infinite 
chain, the left half of which is initially occupied by particles while the right half is empty. 
The problem is to find the probability P{M, N, t) that the A^th particle (counted from the 
right) of the infinite cluster hops at least M times up to time t. With this quantity, we 
obtain for the partition function the expression 

A|y = PiY~Q,X-Q,Y)-P{Y-Q + l,X-Q + l,Y) (4) 

for every Q < X. For Q = X, Eq.(jl]) reduces to ([3]). We remark that Eq.([3]) may be viewed 
as incorporated in Eq.(jl]) in agreement with the notion that the transition probability in an 
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exclusion process with no particles (second argument of P for X + Q) is equal to 1 (this is 
the trivial transition from the empty lattice to the empty lattice). 

To derive Eq.(jl]) more formally, let P(xi, X2, . . . , a;„+i; ki,k2, . . . , kn+i\t) = 
p^^(n+i). ij;(n+i)|i) \)Q ^Jig probability that n + 1 particles located initially at ki, ^2, • • • , ^n+i 
will be at Xi, X2, . . . , Xn+i at time t. Then, the partition function A^y can be written as 
00 2:3— 1 

^ly= E ■■■ E E nx("+^^k("+^y) (5) 

x„+i=Y-~Q X2=Y-X+lxi<Y-X 

Given the positions of n particles at X2, ■ ■ ■ ,Xn+i, the sum of conditional probabilities for 
the first particle to reach any position xi < Xi is 

Y, P(xi|a;2,...,a;„+i)= Y. P{^M^'^) = ^ (6) 

X\<X'2 X\<X1 

and 

X2-1 
Y i'(xi|x(")) = 1 - Y ^(a^ilx^")) (7) 

xx<Y-X zi=y-X+l 

The notations in ([6]) and (jTj) mean that probabilities to find a particle at x\ with respect to 
positions of n other particles, in contrast to probabilities P(x; k|t) conditioned with respect 
to the initial conditions. 

Using d?]), we get the partition function in the form 

00 3:3 — 1 00 X2 — 1 

A|r= E ■■■ E nx(");kH)- Y ■■■ E nx("-^^);k(-^)) (8) 

x„+i=y-Q xi=Y-X+\ x„+i=y-Q+l a;i=y-X+l 

where the time argument Y is omitted. The probability P(x; k|t) for the continuous time 
TASEP has been found in [l5| and for the discrete-time TASEP with the backward sequential 



update in [iq : 



P (x; k 1 1) = det I Pi_, (xi - A;, ; t) I (9) 

where 



Fm{n; t) = — dz{p + ^)*(1 - zy^z'^-^ (10) 

2m i|^|=i_o z 

In terms of P{M, N, t), the first sum in ([8]) is P{Y — Q,X ~ Q,Y) and the second one is 
P{Y -Q + 1,X -Q + 1,Y) which gives (H again. 



For the TASEP with parallel update P{M, N, t) was computed by Johansson [llj who 
used combinatorial methods of the theory of symmetric groups. For the present model with 
backward sequential update the solution was obtained by Rakos and Schiitz using the Bethe 
ansatz method [13^. For more general initial conditions, this problem has been solved by 
Nagao and Sasamoto [1^. The expression for P(M, N,t) obtained in [l3| by evaluation of 
sums of P(x; k|t) reads 

p(M,N,t) = " ' ; ; Y. (11 n fe-'=)(i-«)''}n(«--'.)Mii) 

LLj=lJ-\-^^^ J)- ti,t2,.-,tjv=0 j=l fc=0 i<j 



where q = 1 — p is the jump probability considered in [13|. With ([3]) and (111) this gives 

the exact distribution of the LCS in the Bernoulli matching problem. The cumulative 

distribution 

Q 
Ely = FT0h[Lx,y <Q] = Y1 ^x,y (12) 

M=0 

takes the simple form 

E%y = PiY-Q,X-Q,Y) (13) 



with the natural convention that P{Y, 0, Y) = 1. In 13j it was shown explicitly that P{Y — 
Q,X — Q,Y) = P{X — Q,Y — Q,X) which for the BM model is expected by symmetry. 
The result (1131) provides a simple relation between the cumulative distribution of the length 
of the LCS in the BM model and the distribution of the time-integrated current in the 
backward-sequential TASEP for the step function initial condition. The probability that 
the length of the LCS is at most Q is given by the probability that the number of jumps 
across bond F — X up to time Y is at least X — Q. 

THERMODYNAMIC LIMIT 

We now turn to a brief discussion how the asymptotic results ([2]) of |5| and [a] follow 
from the cumulative distribution (TT3|) . To this end we need the asymptotic properties of 
P{M, N, t) for large arguments. For parallel u pda te Johansson has derived the asyrnptotics 
using results from the random matrix theory [11']. By a transformation proved in 13| this 



yields the asymptotics of the distribution P{M, N, t) for the discrete time TASEP with the 
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backward sequential update. One finds 

lim P([7iV], N, iV^(7, q) + N^^Ml, ?)x) = Fgue{x) (14) 

with 

^(7,g)= ^_^ (15) 

and 

p^/^ (V^+VTmi+vfT)^ .... 

^(^' ^) = 77^ r^^ (1^) 

The function Fg(7£;(x) is the Tracy- Widom distribution of the Gaussian unitary ensemble 



|17|. 



The form of (iMl) indicates that given A^ and 7, the non-trivial scaling regime in time 
is given by the third argument. In the present case, t = Y and 7 are fixed and we search 
for the scaling regime of Q. In our notations, M = Y — Q, N = X — Q,t = Y and 

7 = (F - Q)/{X - Q). From Eq.([T4D we have 

Yr^iX- Q)^(7, q) + {X- g)i/V(7, q)x (17) 

To find the asymptotics of Q for large X and Y, we represent it in the form 

Q{X, Y, q) = Qo{X, Y, q) + i?(X, F, q)x (18) 

Then, the leading term Qo{X, Y, q) can be found from the equation 

2 



y_ x-Qo(x,y,g) / v/y-go(x,y,g) , 

1-P V VX-Qo(X,r,g) ' 



which gives 



Q„(A-.r.,)^ ^v^-^(-'^ + ^-) (20) 

1 — P 

which coincides with the expression found by Seppalainen |5| using probabilistic methods. 
The substution of Eq. lTTS!) with Eq. (!20|) into Eq. (TT7I) leads to the equation for -R(X, y, q) 
which can be resolved in the leading order of x- I^^ the first order, we substitute Q = Qo 
into the second term of the RHS of Eq. (TT7I) to get 

(X - QoY^Mlo, q) = i-y= ^^ =-1 ''' (21) 



(VX - ,/pY)WY - VpX) 
11 



where 

"° = r^Q^ (22) 

The coefficient ci at -R(X, y)x in the expansion of the ffist term of the RHS of Eq. (TT7|) is 

(i-p)v/xy 

""' (v^-v^)(v^-v^) ^ ' 

Then, R{X, Y, q) is the ratio (X — (5o)^^'^cr(7o5 Q')/ci and we obtain 

vxy/ 1 — p 

Both results Eg. (1201) and Eq.( l24l) coincide with corresponding expressions obtained in 61] 



from a comparison with Johansson's result for the directed polymer problem 



|ll|. 



CONCLUSIONS 

We have considered the Bernoulli matching model of sequence alignment with the aim of 
deriving the exact probability that longest common subsequence of two sequences of finite 
lengths X, Y has length N . By a series of mappings we have transformed the matching prob- 
lem to the time evolution of the totally asymmetric simple exclusion process with backward 
sequential update in a step-function initial state. In this mapping the computation of the 
probability distribution turns into the distribution of the time-integrated current through 
a certain bond. This problem has been solved by Rakos and Schiitz [l3| by Bethe ansatz 
methods. Thus the desired result for the Bernoulli matching model for finite sequences has 
been obtained in explicit form (IT3|) through some coordinate transformations from the Bethe 
ansatz. 

In the thermodynamic limit we recover the earlier results of Majumdar and Nechaev 
Gj through an asymptotic analysis where we use the fact that in the thermodynamic limit 
there is a scaling form of the current distribution found by Johansson [11] which involves the 
distribution of the largest eigenvalue of the GUE ensemble of random matrices. Adapting 
this result to the present setting requires again some nontrivial coordinate transformation. 
We find the occurrence of an eigenvalue distribution of random matrices (which is valid also 
for finite sequence lengths, but for the Laguerre ensemble) intriguing. 
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